Q-Learning Example Walk-through

Q-Learning Tables

Q-Learning in its simplest form generally uses a table to keep track of all the discrete q-values calculated for each state-action pair. In this scenario, the algorithm is iteratively run on multiple episodes until the "best action" for each state converges to a consistent choice. The values may continue to change slightly as the algorithm continues to iterate, but the choice of actions, or policy, does not.

The table in this diagram consists of five possible states that may be observed and three possible actions that the agent can take in response.

Initial Conditions

The number of states and actions are fixed, but at first, the agent doesn’t know what action to take, so the table is initialized with zeros or, in some cases high values to encourage exploration (see a discussion on “optimistic initialization values” here). At the start of each episode, the agent is provided a starting observation, usually a random state.

Exploration vs Exploitation

When the agent knows its state, it can look at the values for all the possible actions it could take and choose the highest value. This kind of move is exploitation of the information already available. But is that really the right move? What if the other values are lower just because the agent hasn’t had a chance to evaluate them? If that is the case, the agent might get “stuck” and completely miss the true optimal set of actions (the optimal policy). This is why exploration by the agent is important. A typical method in Q-Learning is to use the epsilon-Greedy exploration method:

\epsilon -Greedy

a_{t}=\begin{cases} a_{t}^{*} & \text{with probability 1} - \epsilon \\ \text{random action} & \text{with probability } \epsilon \\ \end{cases}

The larger the epsilon value, the more frequently the agent chooses a random action instead of the action with the highest q-value in its table, a_t^{*}. This improves the chances that a large part of the state-action space is explored and tested. Once convergence has been achieved, the trained agent can be run without the exploration method.

Example: Line-bot

Imagine an agent that moves along a line with only five discrete positions, [0,4]. The agent can move left, right or stay put. If the agent chooses to move left when at position 0 or right at position 4, the agent just remains in place. The goal state is position 3, but the agent doesn't know that and is going to learn the best policy for getting to the goal via the Q-Learning algorithm. The environment will provide a reward of -1 for all locations except the goal state. The episode ends when the goal is reached.

For this example, the parameters are very simple:

Number of states = 5
Number of actions = 3
Q-table initialized to all zeros.
This is an "optimistic initialization" that encourages exploration, because all the rewards are negative.
Learning rate of \alpha =0.2
Discount rate of \gamma =1.0
Exploration rate of \epsilon =0.0
In other words, \epsilon -Greedy is not implemented

Episode 0, Time 0

At the start of a new episode, the Q-table and agent are initialized.

The agent receives an observation and a reward from the environment:
position=1, which is the "next" state, and reward=-1.0 which is based on the "last" action.
The agent now knows s_0, a_0,r_0,\text{and } s_1 and can update the Q-table value for Q(s_0, a_0).

Recall the Q-Learning equation from before.

{\displaystyle Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}{\bigg )}} ^{\rm {learned~value}}}

Substituting our known values:

{\displaystyle Q(s_{0},a_{0})\leftarrow (1-0.2 )\cdot {Q(s_{0},a_{0})} +0.2\cdot {{\bigg (} {r_{0}} +{\max {a}Q(s{1},a)} {\bigg )}} }

We can find the old value for Q(s_{0},a_{0}) by looking it up in the table for state s_{0}=1 and action a_{0}=stay which is a value of 0. To find the estimate of the optimal future value, \max {a}Q(s{1},a), we need to look at the entire row of actions for the next state, s_{1}=1 and choose the maximum value across all actions. They are all 0 right now, so the maximum is 0. Reducing the equation, we can now update Q(s_{0},a_{0}).

{\displaystyle Q(s_{0},a_{0})\leftarrow -0.2 }

Episode 0, Time 1

At this step, an action must be chosen. The highest value action for position 1 could be either "left" or "right", since they are equal. In this example, "right" is chosen at random.

The cycle continues with a new observation and reward from the environment.

Quiz 2: Line-bot Step

After the action is executed, the agent receives an observation and reward from the environment:
position=2, and reward=-1.0.
The agent now knows s_1, a_1,r_1,\text{and } s_2.

QUESTION:

What is the updated value for Q(s_1, a_1)? (round to the nearest 10th)

ANSWER:

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer

Quiz 3: Line-bot Q-value

Now assume that a number of episodes have been run and the q-table includes the values shown below.
A new episode begins, as before. The environment gives an observation of position=1 and a reward=-1.0.

QUESTION:

What is the new value for Q(1,stay)? (round your answer to the nearest 10th)

ANSWER:

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer

Quiz 4: Epsilon-Greedy

So far, we've assumed that the \epsilon -Greedy is not used. That makes the next action pretty straightforward in the above example - it's the highest value action for state 1, which is "right".

For this question, assume the agent uses \epsilon -Greedy, with a value of \epsilon =0.5.

SOLUTION:

2/3 (66.%)